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WHO AM I AND WHERE DID THIS TALK COME 
FROM? 

Ph.D. Student at Deakin University. 



Research interests include: 

Automated vulnerability discovery. 
Software similarity and classification. 

• Mai ware detection. 



This presentation is based on my malware 
research. 
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Outline 

I introduction (you might already know this) 

N ew approaches to f I owgraph-based 
classification 



3. Evaluation 



Other things we use our system on. 
5. Conclusion 
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Introduction 

o Mai ware a significant proble 

o Static detection of malware i 
technique. 

o Detecting unknown variants 
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Signatures and birthmarks 

A birthmark is an invariant property in related 
samples. 

Birthmark comparison should allow inexact 
matching. 
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Limitations of existing birthmarks 

Byte-level content can change in every variant. 

Comparing birthmarks often exact matching 
only. 

I nefficient for inexact database searching. 

Unable to detect unknown variants of known 
samples. 

o Program structure a better birthmark. 



The software similarity problem 



Program A 



Fingerprint 



Program B 






Fingerprint. 
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The software similarity search 

Need a dissimilarity or distance metric. 

o "Metric" property allows efficient database 
search. 



Query Benign , 



.d(p,qr 



Query Malicious 



i Query 
i Malware 



Existing approaches: A call graph 
birthmark 



o I nter-procedureal control flow. 
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An optimal dissimilarity metric for 

GRAPHS 

o Graph edit distance. 

o N umber of operations to transform one graph to 
another. 

o Complexity in NP. 

Non optimal solutions possible in cubic time. 



Our approach: A set of control flow 

GRAPHS BIRTHMARK 

I ntra-procedural c ontrol flow. 
Many procedures. 
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Transforming graph dissimilarity to a 
string dissimilarity problem 

o Decompile control flow graphs to strings, 
o Compare strings using 'string metrics'. 
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proc(){ 
L_0: 

while (v1 [| v2) { 
L_1: 

if (v3) { 
L_2: 

} else { 
L_4: 

} 
L_5: 

} 
L_7: 

return; 
} 
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Transforming a set of strings problem 
into a string problem 

o Decompiled CFGsgiveusa set of strings, 
o Order and concatenate strings, 
o Del i mi nate substri ngs with 'Z' 



o Order based on metrics. 



Number of instructions in procedure. 

N umber of basic blocks. 

etc 
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What we tried (and ended up not using) 
o String metrics: 

Edit distance -* ed("hello", "ggello") = 2 



NCD (x,y) ■■ 



Normalized Compression Distance -> 

A K TKT K 

Sequence alignment -> | | | | | 

ATKTT T K 

o All databases indexed using metric trees. 



C(xy)- min{ C (x), (C, y)} 
max{ C(x), C(y)} 
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Sequence alignment with blast 

o A heuristic genome sequence search tool . 

Local sequence alignment. 

Hugely popular in bioinformatics. 

o So., transform our strings into genome 
sequences. 

o Then, do a genome search. 




Genome sequence extraction 





-> ACGTRYKMACGTRYKM 

A = Adeline 
C = Cytosine 
G = Guanine 
T = Thyamine 
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Why didn Y we use those approaches? 

o Not optimally effective. 



o Too slow. 



Best speed was using NCD. 



A DISSIMILARITY METRIC FOR SETS OF STRINGS 
(WHAT WE ENDED UP USING) 

Find a mapping between strings to minimize the 
sum of distances. 
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Combinatorial optimisation: The 
assignment problem 

o Finding a minimum cost mapping is known as 
the "assignment problem" 

Optimal solutions exist in cubic time. 



"Greedy" heuristic solutions faster, 
o Has the properties of a metric. 
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Implementation 

M al wise system is 100,000 I i nes of code of C 

The modules for this work <3000 lines of code. 

o Unpacks malware using an application level 
emulator (Ruxcon 2010) 

P re-filtering stage to quickly cull non matching 
variants (Ruxcon 2011) 



Evaluation - Effectiveness 

Calculated similarity between Roron malware 
variants. 



Compared results to Ruxcon 2010 work. 

I n tables, highlighted cells indicates a positive 
match. 

The more matches the more effective it is. 
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Evaluation - Effectiveness 
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Exact Matching 
(Ruxcon 2010) 
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Assignment problem 
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H euri sti c Approxi mate 
Matching (Ruxcon 2010) 



Evaluation - False Positives 

o Database of 10,000 malware. 

o Scanned 1,601 benign binaries. 

o 7 false positives. Less than 1%. 

Very small binaries have small signatures and 
cause weak match i ng. 
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Evaluation - Efficiency 

Median benign and ma I ware processing time is 
0.06s and 0.84s. 
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SlMSEER-A SOFTWARE SIMILARITY WEB 
SERVICE 

• An on line service to identify similarity between 
programs 

• Based on Mai wise. 

• Renders an evol uti onary tree to show program 
relationships. 

• Free to use! 

• http://www.foocodechu.com/?q=simseer-a- 
software-si mi I ari ty-web-servi ce 
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Simseer - Demo 

o http://www. you tube, com/watch ?v=ymo7DK I KCH 
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BUGWISE 

® Automatically detect bugs and vulnerabilities in 
Linux executable binaries. 



® Uses static program analysis from Mai wise. 
Decompilation 
Data Flow Analysis -> 

F ree to use! 



© http://www.foocodechu.com/7qH3uqwise-a-buq- 
det ect i on- web-ser vi ce-f or- bi n a r v-execu t a bl es 




BUGWISE - SGID GAMES XONIX BUG IN DEBIAN 
LINUX 

memset (score_rec[i] .login, 0, 11); 

strncpy (score_rec[i] .login, pw->pw_name, 10); 

memset (score_rec [i] . full, 0, 65); 

strncpy (score_rec [i] . full, fullname, 64); 

score_rec [i] . tstamp = time (NULL) ; 

free (fullname) ; 

if ((high = f reopen (PATH_HIGHSCORE, "w",high)) == NULL) { 

fprint f (stderr, "xonix: cannot reopen high score file\n"); 

free (fullname) ; 

gameover_pending = 0; 

return; 
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Publications 

o Book published by Springer. 

o http://www.spri nqer.com/computer/security-tend+ 
cryptol oqy/book/978- 1-4471-2908-0 




Conclusion 

o Mai wise effectively identifies mal ware variants. 

o Runs in real-time in expected case. 

Large functional code base and years of 
development time. 

H appy to tal k to vendors. 

o http://www.FooCodeChu.com 
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